Introduction¶

Bellabeat is a high-tech manufacturer of health-focused products for women.

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company

Step One - ASK¶

Questions for the analysis¶

1. What are some trends in smart device usage?

2. How could these trends apply to Bellabeat customers?

3. How could these trends help influence Bellabeat marketing strategy

Business task¶

1. Identify potential opportunities for growth

2. Recommendations for the Bellabeat marketing strategy improvement based on trends in smart device usage.

Step Two - PREPARE¶

Data set = Fitbit Fitness Tracker Data made available by Mobius stored on Kaggle.

Legalities = This dataset is under CC0: Public Domain license i.e. the creator has waived their right to the work under copyright law.

ROCCC¶

1. Reliability = LOW - only 30 participants data collected with a number of unknowns apparent such as age and gender.

2. Originality = LOW - this data is Fitbit Fitness Tracker Data made available by Mobius stored on Kaggle, originally collected using Amazon Mechanical Turk.

3. Comprehensive = MEDIUM - multiple fields and pieces of information are available but age and gender are not included and it is noted that only a totla of 2 logs of "Fat" and 67 of Weight are noted. There is also no mention of hydration logging in terms of Bellabeat's interest in marketability of their Spring product.

4. Current = LOW - this data set is now 7 years old and there have been significant improvements and changes in habits within that time, especially during and after the pandemic.

5. Cited = HIGH - the source is a highly data collector and the source is well documented.

Data Selection¶

I will largely focus on the daily use of the device to obtain an overall view of its use but I will also dive into some hourly usage in order to identify specific trends.

Step Three - PROCESS¶

I will use R Studio as my primary tool as it has the capabilities to store and manage large datasets and work between them all at once; it also provides visualization tools all within the one environment.

I will now set up my environment, load and clean the data.

Each of my code chunks will run automatically; I have excluded some outputs due to the length of them.

Set-Up¶

Load Packages¶

First I will set up my environment by loading the necessary packages for this task.

In [1]:
library(tidyverse)
library(ggplot2)
library(readr)
library(here)
library(dplyr)
library(hms)
library(lubridate)
library(shiny)
library(tidyr)
library(plotly)
library(skimr)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /kaggle/working


Attaching package: ‘hms’


The following object is masked from ‘package:lubridate’:

    hms



Attaching package: ‘plotly’


The following object is masked from ‘package:ggplot2’:

    last_plot


The following object is masked from ‘package:stats’:

    filter


The following object is masked from ‘package:graphics’:

    layout


The following object is masked from ‘package:httr’:

    config


Working Directory¶

I will check my working directory is set correctly.

In [2]:
getwd()
'/kaggle/working'

Data Set Upload¶

Now I will upload the csv files required and rename them as appropriate to achieve uniformity.

In [3]:
raw_act_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
raw_cal_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
raw_int_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
raw_step_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
raw_sleep_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')

raw_cal_hour <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')
raw_int_hour <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
raw_step_hour <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')

raw_hr_sec <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
raw_weight <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

raw_cal_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv')
raw_int_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv')
raw_met_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv')
raw_sleep_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv')
raw_step_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv')

Data Set Cleaning¶

Now I will check all of the data for correctness and uniformity; I will reformat as appropriate.

In [4]:
# In order to check my data, I used the following code for each as appropriate; I will not execute them here due to the length of the result.

#str(raw_act_day)
#colnames(raw_act_day)
#rownames(raw_act_day)
#head(raw_act_day)
#skim_without_charts(raw_act_day)
#glimpse(raw_act_day)

# note that column 2 is string instead of date
act_day <- raw_act_day %>%
  mutate(ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y"))
# have now noted that date columns are named differently in each data frame; I 
# will rename for uniformity
act_day <- act_day %>%
  rename(Date = ActivityDate)
# now drop na
act_day <- na.omit(act_day)
# repeat process with all other data frames showing similarly

cal_day <- raw_cal_day %>%
  mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%Y")) %>%
   rename(Date = ActivityDay) %>%
  na.omit()

# on this one, I note that "ActivityHour" contains both date and time and are in # character format, I will separate this into two columns and reformat
cal_hour <- raw_cal_hour
# then
  cal_hour$ActivityHour <-
  as.POSIXct(raw_cal_hour$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# Separate ActivityHour into Date and Time columns
cal_hour <- cal_hour %>%
  separate(ActivityHour, into = c("Date", "Time"), sep = " ") %>%
    na.omit()
# repeat process with all other data frames showing similarly

cal_min <- raw_cal_min
# then
cal_min$ActivityMinute <-
  as.POSIXct(raw_cal_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate ActivityMinute into Date and Time columns
cal_min <- cal_min %>%
  separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

hr_sec <- raw_hr_sec
# then
hr_sec$Time <-
  as.POSIXct(raw_hr_sec$Time, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
hr_sec <- hr_sec %>%
  separate(Time, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

int_day <- raw_int_day %>%
  mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%Y")) %>%
  rename(Date = ActivityDay) %>%
  na.omit()

int_hour <- raw_int_hour
# then
int_hour$ActivityHour <-
  as.POSIXct(raw_int_hour$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
int_hour <- int_hour %>%
  separate(ActivityHour, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

int_min <- raw_int_min
# then
int_min$ActivityMinute <-
  as.POSIXct(raw_int_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
int_min <- int_min %>%
  separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

met_min <- raw_met_min
# then
met_min$ActivityMinute <-
  as.POSIXct(raw_met_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
met_min <- met_min %>%
  separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

# on this one, I note that "SleepDay" contains both date and time and are in 
# character format; however, time appears to be irrelevant as every single entry is noted as 12:00:00AM so I will separate this into two columns and reformat, then remove time altogether
sleep_day <- raw_sleep_day
# then
sleep_day$SleepDay <-
  as.POSIXct(raw_sleep_day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
sleep_day <- sleep_day %>%
  separate(SleepDay, into = c("Date", "Time"), sep = " ") %>%
  select(-Time)
# there's a lot of NA so we won't omit NA on this one

sleep_min <- raw_sleep_min
# then
sleep_min$date <-
  as.POSIXct(raw_sleep_min$date, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
sleep_min <- sleep_min %>%
  separate(date, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

step_day <- raw_step_day %>%
  mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%Y")) %>%
  rename(Date = ActivityDay) %>%
  na.omit()

step_hour <- raw_step_hour
# then
step_hour$ActivityHour <-
  as.POSIXct(raw_step_hour$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
step_hour <- step_hour %>%
  separate(ActivityHour, into = c("Date", "Time"), sep = " ") %>%
  na.omit()
  
step_min <- raw_step_min
# then
step_min$ActivityMinute <-
  as.POSIXct(raw_step_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
step_min <- step_min %>%
  separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
  na.omit()

weight <- raw_weight
# then
weight$Date <-
  as.POSIXct(raw_weight$Date, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
weight <- weight %>%
  separate(Date, into = c("Date", "Time"), sep = " ") 
# there's a lot of NAs, so we won't omit NA on this one
Warning message:
“Expected 2 pieces. Missing pieces filled with `NA` in 413 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”

Further Cleaning and Transformation/Manipulation¶

Merge Data Frames¶

I ensured to check each of the column names for compatibility and effectiveness in terms of analysis together using the following code for each data frame (I have hidden results due to length of this document):

In [5]:
#colnames(cal_day)

This allowed me to choose the most appropriate data frames to place together for analysis:

In [6]:
# List of data frames
list_merge_day <- list(cal_day, int_day, 
                   step_day)

# Use left_join to combine data frames
merge_day <- Reduce(function(x, y) merge(x, y, by = c('Id', 'Date')), 
                          list_merge_day)

# [act_day omitted as it essentially covers all of the above already]
# [sleep_day omitted due to vast different amount of observations, but it is still important to analyse so I will create a separate data frame for it]


merge_act_sleep <- merge(act_day, sleep_day, by=c('Id', 'Date'))

# *******************************************

# List of data frames
list_hour <- list(cal_hour, int_hour, step_hour)

# Use left_join to combine data frames
merge_hour <- reduce(list_hour, left_join, by = c('Id', 'Date', 'Time'))

# *******************************************

# I then figured out a much easier way to do this #d'oh!

merge_cal_weight <- merge(act_day, weight, by=c('Id', 'Date'))

# ******************************************

merge_int_cal <- merge(int_day, cal_day, by=c('Id', 'Date'))

Mutate/Add Extra Columns to assist with analysis¶

In [7]:
# hours asleep is easier to comprehend than minutes asleep so I will add this
# column using an equation
merge_act_sleep$TotalHoursAsleep <- 
  merge_act_sleep$TotalMinutesAsleep / 60

# repeat as necessary below
# **********************************

merge_act_sleep$NonActiveMinutes <- 
  merge_act_sleep$LightlyActiveMinutes + merge_act_sleep$SedentaryMinutes

merge_act_sleep$NonActiveHours <- 
  (merge_act_sleep$LightlyActiveMinutes + merge_act_sleep$SedentaryMinutes) / 60

merge_act_sleep$ActiveMinutes <- 
  merge_act_sleep$VeryActiveMinutes + merge_act_sleep$FairlyActiveMinutes

merge_act_sleep$ActiveHours <- 
  (merge_act_sleep$VeryActiveMinutes + merge_act_sleep$FairlyActiveMinutes) / 60

#**********************************************

merge_day$NonActiveMinutes <- 
  merge_day$LightlyActiveMinutes + merge_day$SedentaryMinutes

merge_day$NonActiveHours <- 
  (merge_day$LightlyActiveMinutes + merge_day$SedentaryMinutes) / 60

merge_day$ActiveMinutes <- 
  merge_day$VeryActiveMinutes + merge_day$FairlyActiveMinutes

merge_day$ActiveHours <- 
  (merge_day$VeryActiveMinutes + merge_day$FairlyActiveMinutes) / 60

# **********************************************

#filter out any ID that contains less than 6 data points in Weight,
# as this will reduce ability to conduct meaningful analysis
merge_cal_weight_filtered <- merge_cal_weight %>%
  group_by(Id) %>%
  filter(n() >= 6) %>%
  ungroup()

# Convert Id to a factor with modified levels for labeling
merge_cal_weight_filtered$Id <- factor(merge_cal_weight_filtered$Id, levels = 
                                unique(merge_cal_weight_filtered$Id))

Step Four - ANALYZE¶

Initial Analysis¶

User IDs¶

The information states that this data is the result of 30 users' consent to participate, so I want to check how many unique users we have and store it as a data frame for reference.

In [8]:
# print list of all data frames in environment
# (commented off due to length)

# print(ls())

# List of data frames
list_all_dfs <- list(
  act_day = act_day,
  cal_day = cal_day,
  cal_hour = cal_hour,
  cal_min = cal_min,
  hr_sec = hr_sec,
  int_day = int_day,
  int_hour = int_hour,
  int_min = int_min,
  met_min = met_min,
  sleep_day = sleep_day,
  sleep_min = sleep_min,
  step_day = step_day,
  step_hour = step_hour,
  step_min = step_min,
  weight = weight
)

# Calculate the number of unique IDs in each data frame
count_unique_id <- sapply(list_all_dfs, function(df) n_distinct(df$Id))

# Create a data frame with the results
df_unique_ids <- data.frame(
  Data_Frame = names(count_unique_id),
  Unique_Count = count_unique_id
)

# Reset row names
rownames(df_unique_ids) <- NULL
In [9]:
# View results
View(df_unique_ids)

# Convert 'Data_Frame' to a factor
df_unique_ids$Data_Frame <- factor(df_unique_ids$Data_Frame, levels = 
                                     df_unique_ids$Data_Frame)

# Create a bar chart
chart_df_unique_ids <- ggplot(data = df_unique_ids, aes(x = Data_Frame, 
                                                        y = Unique_Count)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  labs(title = "Count of Unique IDs in Each Data Frame",
       x = "Data Frame",
       y = "Unique Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

chart_df_unique_ids

ggsave('chart_df_unique_ids.png',width=16,height=8)
A data.frame: 15 × 2
Data_FrameUnique_Count
<chr><int>
act_day 33
cal_day 33
cal_hour 33
cal_min 33
hr_sec 14
int_day 33
int_hour 33
int_min 33
met_min 33
sleep_day24
sleep_min24
step_day 33
step_hour33
step_min 33
weight 8

We can see that 33 IDs use the majority of the functions but the following functions are used less:

1. Sleep by minute and by day (24 IDs)

2. Heartrate per second (14 IDs)

3. Weight logging is the least used (8 IDs)

Comments¶

As only 8 persons have logged their weight, this could indicate that:

1. Users have a negative relationship with weight logging and the feelings that it evokes.

2. Users do not have scales

3. The logging process is difficult in the application

Recommendations¶

Therefore I suggest trying to positively inform customers about a healthy relationship with weight logging.

Other fitness applications have introduced smart scales, which can automatically send the data to the main application, which would assist the user in easy logging and also potentially open up another marketable range.

Comments¶

I note no mention of the use of any water logging data, which leads me to hypothesize that either:

1. This data is missing.

2. This fitness company do not have an option for this, whereas Bellabeat do with "Spring", thus highlighting a potential a gap in the market that Bellabeat can take advantage of.

3. There is that option but no one is availing of it, thus suggesting that there is no market in the gap.

Recommendations¶

Either way, more research is needed in regard to the marketability for the logging of hydration throughout the day.

Date Ranges¶

The information on this data set states that the data provided ranges between 12th March and 12th May 2016; let's check.

In [10]:
range(act_day$Date)
range(cal_day$Date)
range(int_day$Date)
range(sleep_day$Date)
range(step_day$Date)
range(weight$Date)
  1. 2016-04-12
  2. 2016-05-12
  1. 2016-04-12
  2. 2016-05-12
  1. 2016-04-12
  2. 2016-05-12
  1. '2016-04-12'
  2. '2016-05-12'
  1. 2016-04-12
  2. 2016-05-12
  1. '2016-04-12'
  2. '2016-05-12'

All return 2016-04-12 to 2016-05-12 i.e. 12th April to 12th May 2016, so we have only one month as opposed to two, thus meaningful data analysis may be constrained further.

Summaries¶

Let's summarize some of the data, to give us a quick idea of what we're looking at. I used the code colnames() to establish which column names to use for each summary.

In [11]:
summary_act <- act_day %>%
  mutate(SedentaryHours = SedentaryMinutes / 60) %>%
  select(TotalSteps,
         SedentaryHours,
         Calories) %>%
    summary()
print(summary_act)

# ************************************************

summary_sleep <- sleep_day %>%
  mutate(TotalHoursAsleep = TotalMinutesAsleep / 60) %>%
  select(TotalHoursAsleep) %>%
  summary()
print(summary_sleep)

# ************************************************

summary_weight <- weight %>%
  select(WeightKg,
         BMI,
         Fat) %>%
  summary()
print(summary_weight)

# ************************************************

summary_activity_minutes <- int_day %>%
  select(VeryActiveMinutes, 
         FairlyActiveMinutes, 
         LightlyActiveMinutes,
         SedentaryMinutes) %>%
  summary()

print(summary_activity_minutes)
   TotalSteps    SedentaryHours     Calories   
 Min.   :    0   Min.   : 0.00   Min.   :   0  
 1st Qu.: 3790   1st Qu.:12.16   1st Qu.:1828  
 Median : 7406   Median :17.62   Median :2134  
 Mean   : 7638   Mean   :16.52   Mean   :2304  
 3rd Qu.:10727   3rd Qu.:20.49   3rd Qu.:2793  
 Max.   :36019   Max.   :24.00   Max.   :4900  
 TotalHoursAsleep 
 Min.   : 0.9667  
 1st Qu.: 6.0167  
 Median : 7.2167  
 Mean   : 6.9911  
 3rd Qu.: 8.1667  
 Max.   :13.2667  
    WeightKg           BMI             Fat       
 Min.   : 52.60   Min.   :21.45   Min.   :22.00  
 1st Qu.: 61.40   1st Qu.:23.96   1st Qu.:22.75  
 Median : 62.50   Median :24.39   Median :23.50  
 Mean   : 72.04   Mean   :25.19   Mean   :23.50  
 3rd Qu.: 85.05   3rd Qu.:25.56   3rd Qu.:24.25  
 Max.   :133.50   Max.   :47.54   Max.   :25.00  
                                  NA's   :65     
 VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
 Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
 1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
 Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
 Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
 3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
 Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  

Comments¶

The median amount of steps taken daily is 7,406 with the maximum at 36,019!

The median calories burned per day is 2,134 with the max at 4,900.

The median sedentary hours per day is over 17.5 hours! Given that we sleep for 6-8 hours on average, that is 9 further hours spent sedentary - could this be while at work in a desk job?

Recommendation¶

Alert reminder to move during work.

Comments¶

The median hours asleep is just over 7 hours per day with the max at 13 hours, and the min at under an hour!

Recommendation¶

Add calm reminders for bed - maybe team up with "Calm" or similar.

Comments¶

The median Weight in Kg is 62kg, with BMI at 24.39 and Fat at 23.5; healthy BMI is stated as being between 18.5-24.9; however, this is not always fool-proof due to different sports e.g. Strongman/Body Builder.

Healthy fat range for men is 18-24% and for women is 25-31%; while we do not know the genders of our users, the average is within healthy range. However, only 2 entries total were made in "Fat".

Recommendation¶

Provide more information/education about this to customers and perhaps introduce a gadget into the BB range to assist users in calculating this.

Comments¶

  • Very Active Minutes, median = 4

  • Fairly Active Minutes, median = 6

  • Lightly Active Minutes, median = 199 (3.3 hours)

  • Sedentary Minutes, median = 1,057 (17.6 hours)

Whilst 17.6 hours seems excessive, the median sleep is just over 7 hours and an assumption of mine is an 8 hour office day, which would take us to over 15 hours.

Recommendation¶

Add occupation type (admin, labor, etc.) to assist with analysis and accountability for customers. Encourage movement alerts during workday.

Step Five - SHARE¶

I am going to continue my analysis, but will use visualization alongside this now as we go along.

Analysis - Activity Throughout the Day¶

I'm going to analyse the activity throughout the day to try to establish the most commonly active parts of users' days.

In [12]:
# Reformat Date as date and Time as hms
merge_hour <- merge_hour %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"),
         Time = as.POSIXct(Time, format = "%H:%M:%S"))

# Get Mean of Total Intensity
merge_hour_int <- merge_hour %>%
  group_by(Time) %>%
  summarise(int_mean = mean(TotalIntensity))

# Create the bar chart
chart_int_day <-
  ggplot(data = merge_hour_int) +
  geom_bar(mapping = aes(x = Time, y = int_mean),
           stat = "identity", fill = "#40a8a8") +
  labs(title = "Mean Intensity Throughout the Day",
       x = "Time of Day",
       y = "Mean Intensity") +
  scale_x_datetime(breaks = scales::date_breaks("1 hour"),
                   labels = scales::date_format("%H:%M")) +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white"), 
    panel.background = element_rect(fill = "white"),
    text = element_text(color = "black"),
    axis.text.x = element_text(angle = 25, hjust = 1)
  )

chart_int_day

ggsave('chart_int_day.png',width=16,height=8)

From this, we can see that people appear to be most active between 4pm and 6pm, with 11am-1pm following closely behind.

Visualization - Hours Asleep vs Calories Burnt¶

I want to see if there is a correlation between the amount of hours spent sleeping and the amount of calories burned.

The pink plots signify the data points of Total Hours Asleep vs Calories Burned, the Green line signifies the linear model and the blue line signifies the locally weighted scatterplot smoothing.

In [13]:
chart_cal_sleep <-
  ggplot(data=merge_act_sleep) +
  geom_point(mapping=aes(x=Calories,y=TotalHoursAsleep), color='pink') +
  geom_smooth(mapping = aes(x = Calories, y = TotalHoursAsleep), color='green',
              method = 'lm') +
  geom_smooth(mapping = aes(x = Calories, y = TotalHoursAsleep), method = 
                'loess') +
  labs(title='Total Calories Burned vs Total Hours Asleep',
       caption='12th April-12th May 2016') +
  scale_x_continuous(name = "Calories Burned") +
  # adjust the scale as currently only shows "5" and "10"
  scale_y_continuous(
    name = "Total Hours Asleep",
    breaks = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2),
    labels = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2))

ggsave('chart_cal_sleep.png',width=16,height=8)

# Convert ggplot to interactive plotly
charti_cal_sleep <- ggplotly(chart_cal_sleep)

# Display the interactive plot
charti_cal_sleep

ggsave('charti_cal_sleep.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

It does not appear there is a correlation between calories burned and total hours asleep; let's check by calculating the correlation coefficient between the two variables to quantitatively measure the strength and direction of the linear relationship.

In [14]:
cor_cal_sleep <- cor(merge_act_sleep$Calories, 
                               merge_act_sleep$TotalHoursAsleep)
cat("Correlation Coefficient:", cor_cal_sleep, "\n")
Correlation Coefficient: -0.02852571 

Correlation Coefficient: -0.02852571 - this indicates a correlation so weak that no meaningful conclusion can be drawn.

It does not appear there is a correlation between calories burned and total hours asleep; let's see if there's a correlation in sleep quality against different activity levels.

Visualization - Non Active Hours vs Total Hours Asleep¶

I have highlighted 6-8 hours as this is advised as the ideal amount of sleep per night; additionally, the median sleep achieved by this group we saw was just over 7 hours.

In [15]:
chart_sleep_sed <- 
  ggplot(data = merge_act_sleep,
                              aes(x = NonActiveHours, y = TotalHoursAsleep)) +
  geom_point(color = 'pink') +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = 6, ymax = 8),
            fill = "yellow", alpha = 0.006) +  # More transparent highlight
  labs(title = 'Total Non Active Hours vs Total Hours Asleep',
       caption = '12th April - 12th May 2016',
       x = "Non Active Hours",
       y = "Total Hours Asleep") +
  scale_y_continuous(breaks = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2)) +
  theme_minimal()

chart_sleep_sed

ggsave('chart_sleep_sed.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

This appears to show negative correlation; let's check:

In [16]:
cor_nonactive_sleep <- cor(merge_act_sleep$NonActiveHours, 
                           merge_act_sleep$TotalHoursAsleep)
cat("Correlation Coefficient:", cor_nonactive_sleep, "\n")
Correlation Coefficient: -0.5825874 

Correlation Coefficient: -0.5825874 - this indicates a moderate negative correlation so the more Total Hours Asleep moderately corresponds to less Non Active Hours.

Comments¶

It appears that those within the 6-8 hours largely fall between 12.5 to 17.5 hours of non/low activity per day, thus 6.5-11.5 hours of fairly/very active time.

Minus the 6-8 hours spent sleeping, that's only 4.5-9.5 hours non/low activity per day, keeping in mind desk jobs of 8 hours per day.

Let's see if a less sedentary day affects the amount of sleep obtained.

Visualization - Active Hours vs Total Hours Asleep¶

In [17]:
chart_sleep_active <- 
  ggplot(data = merge_act_sleep,
                          aes(x = ActiveHours, y = TotalHoursAsleep)) +
  geom_point(color = 'pink') +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = 6, ymax = 8),
            fill = "yellow", alpha = 0.006) +  # More transparent highlight
  labs(title = 'Total Active Hours vs Total Hours Asleep',
       caption = '12th April - 12th May 2016',
       x = "Active Hours",
       y = "Total Hours Asleep") +
  scale_y_continuous(breaks = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2)) +
  theme_minimal()

chart_sleep_active

ggsave('chart_sleep_active.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

This appears to show a very weak negative correlation; let's check:

In [18]:
cor_active_sleep <- cor(merge_act_sleep$ActiveHours, 
                        merge_act_sleep$TotalHoursAsleep)
cat("Correlation Coefficient:", cor_active_sleep, "\n")
Correlation Coefficient: -0.1812202 

Correlation Coefficient: -0.1812202 - this indicates a weak negative correlation so higher values of one weakly correspond to lower values of the other.

Comments¶

From looking at both of these graphs, the only conclusions I can draw is that the more time spent lightly active/sedentary (Non Active), the less hours of total sleep is obtained.

Recommendations¶

Allow users to set their goal bed time and send alerts to them, for example 1 hour before, to remind them to prepare for bed. A second alert could be sent 30 minutes before-hand, to encourage them to now begin to engage in calming activities to encourage sleep.

Visualization - Calories vs Steps¶

I used ggplot2 and plotly to create an interactive graph showing the calories burned vs Total Steps taken in a day.

In [19]:
# I want a nice color palette for this:
palette_cal_step <- c("#1F78B4", "#33A02C", "#D95F02")

chart_cal_step <- ggplot(data = merge_day) +
  geom_point(mapping = aes(x = Calories, y = StepTotal, color = "Data Points"),
             size = 3, show.legend = TRUE) +
  stat_summary(mapping = aes(x = Calories, y = StepTotal, color = "Median"), 
               fun = median, geom = "point", size = 1, show.legend = TRUE) +
  geom_smooth(mapping = aes(x = Calories, y = StepTotal, color = 
                              "Smoothed Line"), method = 'lm', show.legend = TRUE) +
  labs(title = 'Total Calories Burned vs Total Steps',
       x = "Calories Burned",
       y = "Total Steps",
       color = "Legend") +
  scale_x_continuous(name = "Calories Burned") +
  scale_y_continuous(name = "Total Steps",
                     breaks = seq(0, max(merge_day$StepTotal) + 2000, 
                                  by = 2000),
                     labels = seq(0, max(merge_day$StepTotal) + 2000, 
                                  by = 2000)) +
  scale_color_manual(name = "Legend",
                     values = palette_cal_step,
                     labels = c("Steps/Cals", "Median", "Linear Model")) +
  theme(legend.position = "right")

ggsave('chart_cal_step.png',width=16,height=8)

# Customize legend labels in the interactive plot
charti_cal_step <- plot_ly(data = merge_day) %>%
  add_markers(x = ~Calories, y = ~StepTotal, color = I(palette_cal_step[1]),
              size = 3, name = "Steps/Cals",
              text = ~paste("Calories:", Calories, "<br>Steps:", StepTotal)) %>%
  add_markers(x = ~Calories, y = ~StepTotal, color = I(palette_cal_step[2]),
              size = 1, name = "Median",
              text = ~paste("Calories:", Calories, "<br>Steps:", StepTotal)) %>%
  add_trace(x = ~Calories, y = ~predict(lm(StepTotal ~ Calories)), 
            mode = "lines", color = I(palette_cal_step[3]),
            name = "Linear Model",
            text = ~paste("Calories:", Calories, "<br>Steps:", 
                          round(predict(lm(StepTotal ~ Calories)), 2))) %>%
  layout(title = 'Total Calories Burned vs Total Steps',
         xaxis = list(title = "Calories Burned"),
         yaxis = list(title = "Total Steps"),
         legend = list(title = "Legend",
                       itemsizing = "constant",
                       items = list(
                         list(label = "Steps/Cals", 
                              value = list(color = palette_cal_step[1], size = 12)),
                         list(label = "Median", 
                              value = list(color = palette_cal_step[2], size = 12)),
                         list(label = "Linear Model", 
                              value = list(color = palette_cal_step[3], size = 12))
                       )))

# Display the interactive plot
charti_cal_step

ggsave('charti_cal_step.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
No trace type specified:
  Based on info supplied, a 'scatter' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter

No trace type specified:
  Based on info supplied, a 'scatter' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter

`geom_smooth()` using formula = 'y ~ x'

This appears to show a positive correlation; let's check:

In [20]:
cor_cal_step <- cor(merge_day$Calories, 
                    merge_day$StepTotal)
cat("Correlation Coefficient:", cor_cal_step, "\n")
Correlation Coefficient: 0.5915681 

Correlation Coefficient: 0.5915681 - this indicates a moderate positive correlation so higher values of one moderately correspond to higher values of the other.

This shows a positive correlation between steps taken and calories burned.

Visualization - Steps and Calories vs Weight¶

I analysed the weight (Kg) fluctuation over a month for two users, noting the steps taken in the first visualization and the calories burnt in the second.

In [21]:
chart_step_weight <- 
  ggplot(data = merge_cal_weight_filtered) +
  geom_point(mapping = aes(x = Date, y = WeightKg, color = TotalSteps, 
                           group = Id)) +
  geom_line(mapping = aes(x = Date, y = WeightKg, color = TotalSteps, 
                          group = Id)) +
  scale_color_gradient(low = "red", high = "green") +
  labs(title = "Weight vs Total Steps by Two Users over 1 Month",
       caption = '12th April to 12th May 2016',
       color = 'Total Steps Taken') +
  facet_wrap(~ Id, scales = "free_y")

chart_step_weight

ggsave('chart_step_weight.png',width=16,height=8)

# **********************************************

chart_cal_weight <- 
ggplot(data = merge_cal_weight_filtered) +
  geom_point(mapping = aes(x = Date, y = WeightKg, color = Calories, 
                           group = Id)) +
  geom_line(mapping = aes(x = Date, y = WeightKg, color = Calories, 
                          group = Id)) +
  scale_color_gradient(low = "red", high = "green") +
  labs(title = "Weight vs Calories Burnt by Two Users over 1 Month",
       caption = '12th April to 12th May 2016',
       color = 'Calories Burnt') +
  facet_wrap(~ Id, scales = "free_y")

chart_cal_weight

ggsave('chart_cal_weight.png',width=16,height=8)

Comment¶

I note that subject two appears to have burnt more calories and taken more steps than subject one. I further note that subject one did not make as much impact on weight loss as subject two - let's check:

In [22]:
# log initial weight of both users
initial_weight <- merge_cal_weight_filtered %>%
  filter(Date == as.Date("2016-04-12")) %>%
  pull(WeightKg)

# log final weight of both users
final_weight <- merge_cal_weight_filtered %>%
  filter(Date == as.Date("2016-05-12")) %>%
  pull(WeightKg)

# find percentage weight lost of both users
percentage_weight_loss <- 
  ((initial_weight - final_weight) / initial_weight) * 100

print(percentage_weight_loss)
[1] 0.9599976 2.0979056

Comment¶

Subject one lost under 1% weight while subject two lost over 2% weight; however, we must be aware that we do not know the goals of each subject, nor do we know the calorie intake of each subject.

Recommendation¶

Add a calorie intake tracker and goal tracker to the application.

Visualisation - Steps vs Non Active Hours¶

In [23]:
chart_step_sedentary <- 
  ggplot(data = merge_day) +
  geom_point(mapping=aes(x = StepTotal, y = NonActiveHours)) +
  geom_smooth(mapping=aes(x = StepTotal, y = NonActiveHours), method = 'lm') +
  #facet_wrap(~???)
    labs(title = "Non Active Hours vs Steps per Day",
       x = "Total Daily Steps",
       y = "Non Active Hours",
       caption='12th April to 12th May 2016')

chart_step_sedentary

ggsave('chart_step_sedentary.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

This appears to show a minor negative correlation; let's check:

In [24]:
cor_step_sedentary <- cor(merge_day$StepTotal, 
                          merge_day$NonActiveHours)
cat("Correlation Coefficient:", cor_step_sedentary, "\n")
Correlation Coefficient: -0.1341473 

Correlation Coefficient: -0.1341473 - this indicates a weak positive correlation so there is little to no linear relationship between the two variables.

Comment¶

This could be due to the lack of requirement to take a multitude of steps in order to be active - e.g. weightlifting.

Also to be borne in mind is that a user could be taking a vast amount of steps, but at an extremely leisurely pace and thus the Activity would fall into Sedentary or Lightly Active, which has been counted together as "Non Active" for the purposes of this analysis.

Visualisation - Steps vs Active Hours¶

Let's look at Active Hours vs Total Daily Steps

In [25]:
chart_step_active <-
  ggplot(data = merge_day) +
  geom_point(mapping=aes(x = StepTotal, y = ActiveHours)) +
  geom_smooth(mapping=aes(x = StepTotal, y = ActiveHours), method = 'lm') +
  #facet_wrap(~???)
  labs(title = "Active Hours vs Steps per Day",
       x = "Total Daily Steps",
       y = "Active Hours",
       caption='12th April to 12th May 2016') +
  annotate('text', x=10000,y=5,label='Those who have more Active
                Mintues generally complete more steps',
           color='navy')

chart_step_active

ggsave('chart_step_sedentary.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

This appears to show a positive correlation; let's check:

In [26]:
cor_step_active <- cor(merge_day$StepTotal, 
                          merge_day$ActiveHours)
cat("Correlation Coefficient:", cor_step_active, "\n")
Correlation Coefficient: 0.7335519 

Correlation Coefficient: 0.7335519 - this indicates a moderately strong positive correlation so higher values of one moderately correspond to higher values of the other.

Comment¶

From this it would seem that you have little impact on your steps per day by being lightly active/sedentary, whereas you will have greater impact with being fairly/very active, which makes sense.

Activity Type¶

I have touched on the hypothesis that activity does not necessarily depend on steps taken, and I note that there is no indication of which activity persons are doing when classed in the higher activity levels. Let's explore further and see if we can infer anything in regard to this.

Let's see how much distance is covered per day whilst being classed as "Very Active".

In [27]:
range(int_day$VeryActiveDistance)

summary_vact_dist <- int_day %>%
  select(VeryActiveDistance) %>%
  summary()

print(summary_vact_dist)
  1. 0
  2. 21.9200000762939
 VeryActiveDistance
 Min.   : 0.000    
 1st Qu.: 0.000    
 Median : 0.210    
 Mean   : 1.503    
 3rd Qu.: 2.053    
 Max.   :21.920    

Comment¶

Active distance range is 0-21.92. The median is provided as 0.210 with a maximum of 21.920, which is vastly different - let's investigate.

In [28]:
# Count total number of entries in the month
nrow(int_day)
# 940

# count number of entries above 0 for Very Active
int_day %>%
  filter(VeryActiveDistance > 0) %>%
  nrow()
# 527 - so 413 (43%) people did not have any VeryActiveDistance within the month

# count number of entries above 0 for Moderately Active
int_day %>%
  filter(ModeratelyActiveDistance > 0) %>%
  nrow()
# 554 - so 386 (41%) people did not have any ModeratelyActiveDistance within the
# month

# count number of entries above 0.210 (median)
int_day %>%
  filter(VeryActiveDistance > 0.210) %>%
  nrow()
# 469

# Count data above 10
int_day %>%
  filter(VeryActiveDistance > 10) %>%
  nrow()
# 24

# print the top 3 high distances
int_day %>%
  top_n(3, VeryActiveDistance)
# 21.92, 21.66, 13.4
940
527
554
469
24
A data.frame: 3 × 10
IdDateSedentaryMinutesLightlyActiveMinutesFairlyActiveMinutesVeryActiveMinutesSedentaryActiveDistanceLightActiveDistanceModeratelyActiveDistanceVeryActiveDistance
<dbl><date><int><int><int><int><dbl><dbl><dbl><dbl>
16245800812016-05-011020171631860.021.914.1921.92
80534753282016-05-081073228141250.003.660.5913.40
88776893912016-04-301089223 41240.004.930.0821.66

Inferred Activity¶

Can we infer the type of activity performed by using a threshold?

I will set a threshold for "VeryActiveDistance" to possibly differentiate between stationary and movement-based activities - I have set this as being under 1 to allow for warm up/cool down and general movement around a gym for example.

In [29]:
# Set threshold
distance_threshold <- 0.99

# Create a new column indicating the inferred activity type
int_day$InferredActivity <- 
  ifelse(int_day$VeryActiveDistance < distance_threshold, "Stationary", 
         "Movement")

# Count how many users are classed as participating in each inferred activity
int_day %>%
  filter(VeryActiveDistance > 0) %>%
  group_by(InferredActivity) %>%
  summarise(count = n())
A tibble: 2 × 2
InferredActivitycount
<chr><int>
Movement 343
Stationary184

Recommendation¶

We could consider sending a request/alert to a user whilst they are being classified as "Very" or "Fairly/Moderately" Active, requesting that they declare what type of activity they are doing at that time.

Let's visualize the above investigation onto a graph.

Visualization - Active Minutes vs Active Distance¶

In [30]:
# Create a scatter plot

chart_vact_min_dist <- 
  ggplot(data = int_day %>%
                                filter(VeryActiveMinutes > 0), 
                              aes(x = VeryActiveMinutes, y = VeryActiveDistance)) +
  geom_point(aes(color = InferredActivity), size = 3) +
  geom_smooth(color = 'skyblue', size = 1, method = 'lm') +
  labs(title = "Very Active Minutes vs Very Active Distance",
       x = "Very Active Minutes",
       y = "Very Active Distance",
       color = "Inferred Activity") +
  scale_color_manual(values = c("Stationary" = "red", "Movement" = "darkgreen")) +
  theme_minimal()

chart_vact_min_dist

ggsave('chart_vact_min_dist.png', width=16, height=8)

# Let's check out a density plot with the data

chart_dens_vact_min_dist <- 
  ggplot(data = int_day %>%
         filter(VeryActiveMinutes > 0),
       aes(x = VeryActiveDistance)) +
  geom_density(fill = "blue", alpha = 0.2) +
  labs(title = "Density Plot of Very Active Distance",
       x = "Very Active Distance",
       y = "Density") +
  theme_minimal()

chart_dens_vact_min_dist

ggsave('chart_dens_vact_min_dist.png', width=16, height=8)
Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Comment¶

The Scatterplot appears to show a moderately strong positive correlation between the time spent as "Very Active" per day and the Distance completed whilst being classified as "Very Active". However, there are a number of outliers.

With the density plot, we can easily see that the majority of data is between 0-5, but it also shows us the range of data as well; we can confirm this:

In [31]:
# Count data between distance of 0 and 5
int_day %>%
  filter(VeryActiveDistance >= 0, VeryActiveDistance <= 5) %>%
  nrow()
# 865 out 940 of the data between 0 and 5, as shown in the density plot.
865

Let's see how calories sit in this mix.

In [32]:
chart_vact_cal_dist <- 
  ggplot(data = merge_int_cal %>%
                                filter(VeryActiveMinutes > 0)) +
  geom_point(mapping=aes(x = VeryActiveDistance, y = Calories)) +
  geom_smooth(mapping=aes(x = VeryActiveDistance, y = Calories), method='lm') +
  labs(title = "Very Active Distance vs Calories",
       x = "Very Active Distance",
       y = "Calories",
  theme_minimal())

chart_vact_cal_dist

ggsave('chart_vact_cal_dist.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

There does appear to be correlation between calories burnt and the distance traveled, let's check:

In [33]:
cor_vact_cal_dist <- cor(merge_int_cal$VeryActiveDistance, 
                         merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_vact_cal_dist, "\n")
Correlation Coefficient: 0.4919586 

Correlation Coefficient: 0.4919586 - this indicates a moderately positive correlation so as one value increase, so does the other (moderately)

Let's plot calories against Very Active Minutes.

In [34]:
chart_vact_cal_min <- 
  ggplot(data = merge_int_cal %>%
                                filter(VeryActiveMinutes > 0)) +
  geom_point(mapping=aes(x = VeryActiveMinutes, y = Calories)) +
  geom_smooth(mapping=aes(x = VeryActiveMinutes, y = Calories), method='lm') +
  labs(title = "Very Active Minutes vs Calories",
       x = "Very Active Minutes",
       y = "Calories",
       theme_minimal())

chart_vact_cal_min

ggsave('chart_vact_cal_min.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

There does appear to be correlation between calories burnt and the minutes spent as very active, let's check:

In [35]:
cor_vact_cal_min <- cor(merge_int_cal$VeryActiveMinutes, 
                         merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_vact_cal_min, "\n")
Correlation Coefficient: 0.6158383 

Correlation Coefficient: 0.6158383 - this indicates a stronger positive correlation so as one value increases, so does the other; from this, we can hypothesize that it is much more important to increase our time as "Very Active" as opposed to our distance traveled i.e. high intensity weight lifting may burn more calories than long distance running/cycling for example.

Recommendations¶

Provide information to customers regarding this finding.

Provide weight lifting routines for customers.

Comment¶

I note outliers such as one which shows VeryActiveMinutes below 25 but calories burnt as almost 5,000 - perhaps they are conducting less intensive exercise but in higher quantities i.e. Fairly Active Minutes as opposed to Very Active Minutes, let's take a quick look at that:

In [36]:
chart_mact_cal_min <- 
  ggplot(data = merge_int_cal %>%
                               filter(FairlyActiveMinutes > 0)) +
  geom_point(mapping=aes(x = FairlyActiveMinutes, y = Calories)) +
  geom_smooth(mapping=aes(x = FairlyActiveMinutes, y = Calories), method='lm') +
  labs(title = "Fairly Minutes vs Calories",
       x = "Fairly Active Minutes",
       y = "Calories",
       theme_minimal())

chart_mact_cal_min

ggsave('chart_mact_cal_min.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

There is a noticeable difference here in the correlation, let's check:

In [37]:
cor_mact_cal_min <- cor(merge_int_cal$FairlyActiveMinutes, 
                        merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_mact_cal_min, "\n")
Correlation Coefficient: 0.2976235 

Correlation Coefficient: 0.2976235 - this indicates a weak positive correlation so as one value increase, the other also increases but weakly; from this, we can tell that higher intensity exercise is much more closely related to calorie burn.

Comment¶

However, I note that same outlier as before near the 5,000 calorie mark but very low in Very and Fairly Active Minutes, let's check Lightly Active Minutes.

In [38]:
chart_lact_cal_min <- 
  ggplot(data = merge_int_cal %>%
                               filter(LightlyActiveMinutes > 0)) +
  geom_point(mapping=aes(x = LightlyActiveMinutes, y = Calories)) +
  geom_smooth(mapping=aes(x = LightlyActiveMinutes, y = Calories), method='lm') +
  labs(title = "Lightly Minutes vs Calories",
       x = "Lightly Active Minutes",
       y = "Calories",
       theme_minimal())

chart_lact_cal_min

ggsave('chart_lact_cal_min.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

The correlation here is similar to that of Fairly Active, let's check:

In [39]:
cor_lact_cal_min <- cor(merge_int_cal$LightlyActiveMinutes, 
                        merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_lact_cal_min, "\n")
Correlation Coefficient: 0.2867175 

Correlation Coefficient: 0.2867175 - this indicates a weak positive correlation so as one value increase, the other also increases but weakly; from this, we can tell that higher intensity exercise is much more closely related to calorie burn.

Comment¶

We can see that this outlier near the 5,000 calorie mark is around 300 Lightly Active Minutes, let's examine them and see where they are burning their high level of calories versus a low calorie burning outlier:

In [40]:
# Filter data for the high Calorie burners (over 4500 Calories)
outlier_high <- merge_int_cal %>%
  top_n(5, wt = Calories) %>%
  mutate(Group = "High Calorie Burn")

# Filter data for the low Calorie burners (under 500 Calories)
outlier_low <- merge_int_cal %>%
  top_n(-5, wt = Calories) %>%
  mutate(Group = "Low Calorie Burn")

# Combine the dataframes into one called "outliers"
outliers <- bind_rows(outlier_high, outlier_low) %>%
  select(Id, Date, Group, everything())

# Let's plot it

chart_cal_sed <- 
  ggplot(outliers, aes(x = Group, y = SedentaryMinutes, fill = Group)) +
  geom_violin() +
  labs(title = "Comparison of Sedentary Minutes between High and Low 
       Calorie Burners",
       x = "Group",
       y = "Sedentary Minutes",
       fill = "Calorie Burn Group") +  # Add the fill scale label
  scale_fill_manual(values = c("High Calorie Burn" = "darkgreen", 
                               "Low Calorie Burn" = "red")) +
  theme_minimal()

chart_cal_sed

ggsave('chart_cal_sed.png', width=16,height=8)

It's clear to see with the low calorie burning group, that a significant portion of this group has similar levels of sedentary behaviour, reaching almost 1500 minutes; a consistent lifestyle/routine is suggested here.

The high calorie burn group shows a wider distribution of sedentary minutes; it would appear that the majority have between 500-650 with a smaller proportion having up to 900 sedentary minutes.

Step Six - CONCLUSIONS¶

The data presents a number of difficulties due to the low number of participants, and the suspicion that not all data/applications were utilized and thus not all data was available.

Some further research should be conducted in regards to the logging of hydration throughout the day as we have no information regarding that.

There are some other potential marketable ranges to explore in the Bellabeat range as well such as teaming up with "Calm" or similar to encourage better sleep patterns.

Other recommendations included adding Smart Scales to the Bellabeat range and encouraging a more positive relationship with the logging of weight as well as providing more informative and engaging content to help our customers achieve their goals and obtain strength and confidence from the feeling of being able to independently achieve this.